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Abstract 

The  goal  of  High  Performance  Fortran  (HPF)  is  to  ^^address  the  problems  of  writing  data  parallel 
programs  where  the  distribution  of  data  affects  performance*^  providing  the  user  with  a  high-level 
language  interface  for  programming  scalable  parallel  architectures  and  delegating  to  the  compiler  the 
task  of  producing  an  explicitly  parallel  message-passing  program.  For  some  applications,  this  ap¬ 
proach  may  result  in  dramatic  performance  losses.  An  important  example  is  the  inspector/executor 
paradigm,  which  HPF  uses  to  support  irregular  daia  accesses  in  parallel  loops.  In  many  cases, 
the  compiler  does  not  have  sufficient  information  to  decide  whether  an  inspector  computation  is 
redundant  or  needs  to  be  repeated.  In  such  cases,  the  performance  of  the  whole  program  may  be 
significantly  degraded. 

In  this  paper,  we  describe  an  approach  to  solve  this  problem  through  the  introduction  of  con¬ 
structs  allowing  explicit  manipulation  of  communication  schedules  at  the  HPF  language  level.  The 
goal  is  to  avoid  the  use  of  EX  TRIMS  ICS  for  expressing  irregular  computation  via  message-passing 
primitives,  while  guaranteeing  essentially  the  same  performance.  These  language  features  allow 
the  user  to  control  the  reuse  of  schedules  and  to  specify  access  patterns  that  map  be  used  to  com¬ 
pute  a  schedule.  They  are  being  implemented  as  part  of  the  HPF-j-  language  and  we  report  som.e 
preliminary  performance  numbers  from  this  implementation. 
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1  Introduction 


The  High  Performance  Fortran  Forum  (HPFF),  which  first  convened  during  1992,  set  itself  the  task 
of  defining  language  extensions  for  Fortran  to  facilitate  data  parallel  programming  on  a  wide  range 
of  parallel  architectures,  without  sacrificing  performance.  Much  of  the  work  on  HPF-1  [11]  focussed 
on  extending  Fortran  90  by  directives  for  specifying  alignment  and  regular  data  distributions  (block 
and  cyclic).  Other  extensions  allow  the  specification  of  explicitly  parallel  loops,  in  particular 
with  the  FORALL  statement  and  construct  and  the  INDEPENDENT  directive,  as  well  as  a 
number  of  library  routines.  It  soon  became  apparent  that  HPF-1  did  not  provide  enough  flexibility 
for  an  eflficient  formulation  of  many  advanced  algorithms.  Such  algorithms,  for  example  weather 
forecasting  codes  or  crash  simulations,  are  often  characterized  by  the  need  to  distribute  data  in  an 
irregular  manner  and  dynamically  balance  the  computational  load  of  the  processors.  HPF-2  [12], 
together  with  its  “approved  extensions”,  is  a  significant  step  in  supporting  such  algorithms. 

HPF  directives  provide,  at  a  high  level  of  abstraction,  information  useful  to  the  compiler  in  the 
context  of  a  single-threaded  data  parallel  paradigm  with  a  global  address  space.  It  is  the  respon¬ 
sibility  of  the  compiler  to  translate  a  program  containing  such  directives  into  an  efiicient  parallel 
Single-Program-Multiple-Data  (SPMD)  target  code  using  explicit  constructs  for  data-sharing,  such 
as  message-passing  on  distributed  memory  machines.  The  study  of  real  applications  has,  however, 
revealed  that  even  with  the  enhanced  data  and  work  distributions  offered  by  the  latest  version  of  the 
language,  certain  “low-level”  information  that  may  decisively  influence  a  program’s  performance 
cannot  be  expressed. 

In  this  paper  we  focus  on  one  such  example  —  the  efficiency  of  the  communication  required  to 

handle  irregular  data  accesses  in  parallel  loops.  Most  compilers  generate  code  which  at  runtime 
determines  a  communication  schedule  to  handle  the  gathering  and  scattering  of  nonlocal  data 
required  for  executing  the  parallel  loop.  The  computation  of  such  a  schedule,  based  on  a  runtime 
analysis  of  the  access  patterns  to  an  array,  is  performed  by  a  routine  called  an  inspector.  Inspectors 
may  be  very  time-consuming;  as  a  consequence,  the  avoidance  of  redundant  inspector  executions 
is  a  high-priority  goal. 

In  this  paper,  we  support  this  goal  by  proposing  a  method  which  allows  control  of  schedule 
computations  at  the  HPF  language  level.  Our  method  is  based  on  the  concept  of  schedule  varictbles, 
whose  values  axe  communication  schedules  computed  by  an  inspector  or,  alternatively,  defined  by 
the  user  by  specifying  the  access  pattern  to  an  array  with  appropriate  high-level  directives. 

Thus,  using  these  features  the  programmer  is  not  forced  to  switch  to  the  message-passing 
paradigm  via  the  EXTRINSICS  facility  of  HPF,  while  obtaining  basically  the  same  performance 
as  with  message  passing.  The  concepts  described  here  are  applicable  to  any  HPF-like  language; 
they  are  currently  being  implemented  as  part  of  the  language  HPF+,  which  was  initially  specified  in 
[5]  and  is  currently  being  developed  into  a  fully-specified  Fortran  95-ba8ed  programming  language 
in  the  ESPRIT  Long  Term  Research  Project  “HPF-I-” .  The  required  compiler  and  runtime  support 
is  being  implemented  as  a  part  of  the  Vienna  Fortran  Compilation  System  (VFCS)  [2]. 

The  paper  is  organized  as  follows.  The  next  section  providas  a  more  detailed  discussion  of 
the  inspector-executor  paradigm  and  additional  motivation  for  our  work.  Section  3  introduces  the 
concepts  and  terminology  required  to  describe  schedules  and  access  patterns.  Section  4  introduces 
the  new  features  to  control  the  generation  of  schedules  through  a  series  of  small  examples  while  a 


1 


more  complete  example  is  given  in  Section  5.  Section  6  describes  how  users  can  specify  the  access 
patterns  for  a  loop.  Section  7  discusses  the  implementation  of  these  features  and  provides  some 
details  of  the  performance  we  have  gained  when  using  these  concepts  in  a  compiler.  The  paper 
ends  with  an  overview  of  related  work  (Section  8)  and  a  short  conclusion. 

2  Motivation 

In  many  cases  -  in  particular,  for  regular  numerical  computations  operating  on  dense  data  structures 
-  an  HPF  compiler  can  statically  determine  the  access  patterns  for  arrays  in  the  source  program  and, 
from  this  information,  derive  the  required  communication  in  the  parallel  target  program.  This  is  not 
possible  for  irregular  problems,  in  which  arrays  are  accessed  via  index  arrays  defined  dynamically. 
For  such  problems,  the  access  patterns  as  well  as  the  associated  communication  schedules  must 
be  computed  at  rimtime.  The  standard  implementation  technology  for  this  situation  is  called  the 
inspector- executor  paradigm  [13,  16,  17].  That  is,  for  a  parallel  loop  with  indirect  data  accesses, 
an  inspector  analyzes  the  data  access  pattern  at  runtime.  From  the  access  pattern,  the  array 
distribution,  and  the  work  distribution  of  the  loop,  a  communication  schedule  can  be  derived.  This 
schedule  is  then  used  in  the  executor  to  actually  perform  the  required  communication  for  the  loop 
by  gathering  all  nonlocal  data  to  be  read  in  the  loop  at  the  beginning,  and  scattering  all  nonlocal 
data  that  were  written  in  the  loop,  at  the  end. 

The  inspector  phase  for  a  parallel  loop  can  be  very  expensive  and  may  dominate  the  execution 
time  for  a  whole  program.  However,  the  parallel  loop  is  often  enclosed  in  a  sequential  loop  (for 
example,  a  time-step  loop),  and  its  communication  schedule  may  be  invariant  over  many  or  all  of  the 
iterations  of  the  sequential  loop.  The  recognition  of  invariant  communication  schedules  is  crucial 
for  obtaining  high  object  code  performance.  In  some  cases,  a  compiler  can  recognize  this  invariance 
automatically.  However,  often  such  loop  structures  are  hidden  in  a  hierarchy  of  procedure  calls,  so 
that  even  sophisticated  interprocedural  analysis  may  be  unable  to  achieve  this  goal. 

The  main  motivation  for  the  language  extensions  described  in  this  paper  is  to  allow  the  user 
explicit  control  over  the  generation  of  communication  schedules,  thus  avoiding  the  redundant  com¬ 
putation  of  communication  schedules  by  an  inspector.  We  introduce  schedule  functions  as  mappings 
from  a  processor  to  sets  of  array  indices,  describing  the  set  of  array  elements  that  have  to  be  com¬ 
municated  (read  or  written)  to  or  from  each  processor.  By  combining  schedule  functions  with 
a  specific  array  and  a  direction  (read  or  write),  we  obtain  an  array  schedule,  which  is  a  precise 
specification  of  a  gather  or  scatter  communication  for  that  array.  We  also  introduce  the  concept 
of  a  schedule  variable  which  can  be  used  to  name  the  schedule  computed  by  an  inspector.  Using 
schedule  variables,  the  user  can  explicitly  control  the  reuse  of  schedules  by  directing  the  compiler  to 
skip  inspector  analysis  if  a  related  schedule  variable  already  contains  the  communication  schedule 
required  for  executing  the  loop. 

For  parallel  loops  with  complex  bodies,  possibly  including  nested  loops,  conditional  statements 
and  procedure  calls,  the  inspector  computation  can  become  highly  complex.  If,  in  such  a  situation, 
the  user  is  in  a  position  to  specify  the  access  pattern  of  a  loop,  the  inspector  computation  can 
be  replaced  by  an  evaluation  of  that  pattern  in  the  context  of  the  loop.  Our  language  extensions 
support  this  feature  by  introducing  the  concept  of  a  pattern  function  -  a  mapping  from  a  loop 
iteration  range  to  the  powerset  of  an  array  index  domain.  A  pattern  function  can  be  combined 
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with  a  specific  loop  and  a  specific  array,  to  provide  an  array  loop  pattern  which  can  be  directly 
mapped  to  an  array  schedule.  We  provide  a  mechanism  for  the  specification  of  simple  patterns 
based  on  a  generalization  of  Fortran  90  section  notation,  taking  advantage  of  the  fact  that  vector 
subscripts  can  be  used  to  describe  irregular  access  patterns.  For  more  general  cases,  for  example 
when  access  patterns  depend  on  nested  conditionals  and  loops,  the  absence  of  set  types  in  Fortran 
enforces  new  syntax  and  semantics.  We  are  currently  studying  mechanisms  based  on  the  concept  of 
user-defined  pattern  junctions,  which  allow  the  specification  of  arbitrary  patterns,  and  will  report 
on  such  mechanisms  in  a  future  paper. 

3  Schedules  and  Patterns:  Concepts  and  Terminology 

In  this  section,  we  introduce  the  concepts  and  terminology  required  for  a  discussion  of  communica¬ 
tion  schedules  and  access  patterns.  Assume  for  the  following  that  A  is  an  array  with  index  domain 
I,  P  denotes  a  set  of  processors,  and  L  denotes  a  perfectly  nested  loop  with  iteration  domain  R. 

3.1  Data  and  Work  Distributions 
Definition  1  Data  Distribution 

1.  A  (replication-free)  (data)  distribution  for  A  with  respect  to  F  is  a  total  function  6"^  :1 

P. 

2.  Let  A,  6^  and  p  6  P  6e  given.  Then  \^{p)  :=  {i  e  1 1  5^(i)  =  p)  determines  the  set 

of  elements  of  A  owned  by  processor  p  under  distribution  .  This  is  called  the  distribution 
segment  of  A  with  respect  to  p,  its  elements  are  the  local  variables  of  A  in  processor  p.  □ 

Definition  2  Work  Distribution 

1.  A  work  distribution  for  L  is  a  total  function  :  R  P.  For  each  r  S  R,  a)^(r)  specifies 
the  processor  in  which  iteration  r  is  to  be  executed. 

2.  Let  L,  a  work  distribution  andp  e  P  6e  given.  Then  X^ip)  denotes  the  set  of  all  iterations 
of  L  to  be  executed  in  p:  x^{p)  :=  {r  €  R  |  w^(r)  =  p).  x^{p)  caZZed  the  execution  set 
ofp  with  respect  to  L. 

3.2  Schedules 

Definition  3  Schedule  Factions  and  Array  Schedules 

1.  Let  I  denote  an  array  index  domain.  A  schedule  function  is  a  total  function  a  :  P  ^  ■p(I), 
where  VlfL)  designates  the  powerset  (set  of  all  subsets)  of  the  index  domain  I. 

2.  An  array  schedule  for  array  A  with  distribution  segment  is  a  triplet  {A,a,d),  where 

(a)  a  is  a  schedule  function  with  range  'P{1) 

(b)  d  e  {i?,  W}  specifies  a  direction:  read  (“R”),  or  write)  (“W”).  If  d=“R”,  then  the 
array  schedule  is  called  a  read  or  gather  schedule,  otherwise  a  write  or  scatter  schedule. 

(c)  for  each  p  €  P  :  ofp)  fl  A'^(p)  =  0. 
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Schedule  Application 

Let  a  =  (A,  <7,  R)  denote  a  read  schedule.  Then  the  application  of  a  results  for  each  processor  p  in 

•  receiving  all  elements  A(i)  with  i  6  o{p).  By  definition,  these  elements  are  nonlocal  with 
respect  to  p. 

t  sending  all  elements  A(i)  local  to  p  to  all  processors  p'  such  that  i  e  o{p'). 

Each  processor  p  must  establish  a  set  of  buffer  elements  in  its  local  memory  for  storing  the 
nonlocal  objects  specified  in  o{p)- 

Let  p,p'  denote  two  arbitrary  distinct  processors.  Then  we  denote  by  REGEIVE(p,p’)  the  set  of 
elements  of  A  to  be  received  in  p  from  p',  and  by  SEND(p,p’)~RECEIVE(p’,p)  the  set  of  elements 
of  A  to  be  sent  from  p  to  p'.  Based  on  the  above  read  sdiedule  a,  the  receive  sets  can  be  immedi¬ 
ately  specified  as  RECEIVE(p,p’):=a(p)  n  A^(p').  Due  to  the  symmetry  of  the  two  classes  of  sets, 
we  can  then  determine  all  send  sets  after  a  global  communication  phase. 

The  application  of  a  write  schedule  is  similar.  Note  that  write  schedules  are  only  relevant  if  a 
work  distribution  different  from  the  owner  computes  paradigm  [20]  is  used. 

3.3  Patterns 

Schedules,  as  discussed  above,  are  usually  determined  based  on  an  inspector  analysis  of  array  access 
patterns  in  a  loop.  In  this  section,  we  formalize  the  concepts  related  to  patterns  and  specify  the 
mapping  from  patterns  to  schedules. 

Definition  4  Pattern  Functions  and  Array  Loop  Patterns 

1.  Let  R  denote  a  loop  iteration  domain,  and  I  an  array  index  domain.  A  pattern  function 
is  a  total  function  rr  :  R  — +  ■P(I). 

2.  An  array  loop  pattern  (AXF)  for  array  A  and  loop  L  with  direction  d  is  a  quadruplet 
(L,  A,7r,  d),  where  ir  is  a  pattern  function  mapping  the  loop  iteration  domain  of  L  to  the  in¬ 
dex  domain  associated  with  A.  An  ALP  is  respectively  called  a  read  pattern  or  a  write  pattern, 
depending  on  whether  d  =  “i?"  or  d=  “W". 


A  pattern  function  describes  an  array  access  pattern  for  the  iterations  of  a  parallel  loop.  More 
specifically,  it  expresses  the  fact  that,  for  each  r  6  R,  iteration  r  accesses  the  elements  with  indices 
i  e  7r(r)  for  some  array.  Note  that  tt  does  not  depend  on  a  particular  array  or  loop  -  it  refers  only 
to  the  associated  index  and  iteration  domains  -  and  is  also  independent  of  related  data  and  work 
distributions. 

When  we  bind  a  pattern  function  to  a  specific  array  (with  a  well-defined  distribution),  a  specific 
loop  (with  a  well-defined  work  distribution),  and  a  direction  we  obtain  an  ALP.  R:om  a  given  ALP, 

an  array  schedule  can  be  determined  according  to  the  following  lemma: 

Lemma  1  Mapping  ALPs  to  Arra^  Schedules 

Let  (L,A,7r,  d)  denote  an  ALP.  The  corresponding  array  schedule  is  given  by  (A,o-,d),  where  a  = 
p2s{uj^,S^,'k)  is  defined  as 
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<^(P)  ■=  Ur€x(p)  ~  ^(P')  ^ 

□ 

As  can  be  seen  from  the  construction  used  in  the  above  lemma,  the  array  schedule  for  a  given 
ALP  only  depends  on  the  associated  data  and  work  distributions.  Two  ALPs  {L,A,'ir,d)  and 
{L',A',Tv',d')  are  called  equivalent  iff 

•  the  loop  iteration  domains  and  work  distributions  of  L  and  L'  are  identical, 

•  the  index  domains  and  data  distributions  of  A  and  A'  are  identical,  and 

•  their  pattern  functions,  tt  and  tt' ,  are  identical. 


4  Language  Features  for  Schedule  Control 

In  this  section,  we  describe  the  syntax  and  semantics  of  language  extensions  for  the  exphcit  control 
of  schedules.  The  main  objective  for  the  introduction  of  these  features  is  to  provide  the  user  with 
a  means  to  assert  to  the  compiler /runtime  system  that  a  schedule  computation  is  redundant  and 
can  be  suppressed,  reusing  a  previously  computed  schedule. 

4.1  Schedule  Variables  and  Their  Vedues 

Schedule  variables  are  declared  using  a  SCHEDULE  directive,  which  must  occur  in  the  speci¬ 
fication  part  of  a  program  unit. 

IHPF+  SCHEDULE::  S 

Schedule  variables  can  be  organized  into  arrays  and  appear  as  components  of  derived  types.  At  any 
time  within  its  scope,  a  schedule  variable  is  either  undefined  or  has  a  well-defined  value.  Initially  - 
immediately  after  processing  its  declaration  (or,  if  a  SAVE  attribute  is  specified,  after  processing 
the  first  instance  of  the  declaration)  -  a  schedule  variable  is  set  to  undefined.  Another  way  to  set 
a  schedule  variable  to  undefined  is  to  apply  the  RESET  directive  to  the  variable. 

At  any  time,  a  schedule  variable  is  associated  with  a  (possibly  empty)  set  of  array  schedules, 
{(Ai,£r,di),  . . . ,  {An,cr,dn)},  n  >  0.  If  the  schedule  variable  is  defined,  i.e.  n  >  0,  then  all  Ai  have 
the  same  index  domain,  I,  the  same  data  distribution,  S,  and  the  same  communication  behavior, 
as  specified  by  a.  We  call  the  triplet  (I,  S,  cr)  the  core  of  the  value  bound  to  the  schedule  variable. 

A  given  array  may  occur  in  two  array  schedules  associated  with  a  schedule  variable,  if  it  uses  a 
gather  and  a  scatter  schedule  with  the  same  schedule  function. 

If  we  discuss  the  reuse  of  a  schedule  via.  a  schedule  variable  with  a  defined  value,  then  only  its 
core  is  relevant:  the  schedule  variable  may  be  associated  with  a  new  set  of  arrays  and  arbitrary 
directions,  as  long  as  its  core  remains  the  same. 

In  the  following,  we  will  often  characterize  the  value  of  a  schedule  variable  by  a  set  of  equivalent 
ALPs  rather  than  by  explicitly  specifying  it  in  the  above  sense.  Since,  by  Lemma  1,  these  ALPs 
uniquely  determine  a  core  as  well  as  a  list  of  associated  array  schedules,  this  does  not  restrict  the 
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generality  of  the  discussion. 


Our  language  extensions  provide  multiple  ways  for  specifying  the  value  of  a  schedule  variable. 
The  most  important  method  is  to  use  the  result  of  the  (implicit)  inspector  analysis  performed  for 
a  parallel  loop.  Another  method  is  schedule  assignment,  which,  in  a  way  similar  to  conventional 
assignment,  binds  the  result  of  a  schedule  expression  to  a  schedule  variable.  Furthermore,  there  are 
methods  for  the  explicit  specification  of  patterns,  discussed  in  Section  6* 

There  is  one  particular  context  for  a  schedule  variable,  established  by  a  parallel  loop,  which 
has  a  special  semantics.  We  call  this  a  def-use  context.  If  a  schedule  variable  occurring  in  such 
a  context  is  undefined,  then  a  value  for  the  variable  will  be  defined  (for  example,  by  executing 
an  inspector  or  evaluating  a  pattern  specification).  If,  on  the  other  hand,  the  schedule  variable  is 
already  defined,  then  the  schedule  computation  will  be  suppressed  and  the  schedule  to  which  the 
variable  is  bound  will  be  applied.  The  occurrence  of  a  schedule  variable  in  a  def-use  context  is 
semantically  correct  only  if  the  associated  schedule  function  specifies  precisely  the  communication 
that  has  to  be  performed. 

For  certain  simple  cases,  the  user  can  control  the  redundancy  of  a  schedule  computation  without 
explicitly  introducing  schedule  variables.  As  described  in  the  next  subsection,  we  provide  a  single 
keyword,  REUSE ,  to  support  this  functionality. 

In  the  next  few  subsections,  we  use  a  series  of  examples  to  show  how  users  can  control  the  com¬ 
putation  of  the  schedule  in  various  situations.  We  always  assume  here  that  the  original  computation 
of  a  schedule  is  performed  by  an  inspector. 

4.2  Unnamed  Schedules 

In  some  situations,  the  implementation  can  be  directed  to  reuse  a  set  of  schedules  without  the 
need  to  exphcitly  introduce  schedule  variables  for  that  purpose.  In  particular,  if  a  schedule  is  to 
be  reused  with  the  same  loop  being  called  repeatedly  and  is  not  used  with  any  other  loop  the  user 
does  not  need  to  explicitly  name  the  schedule  and  can  control  the  reuse  of  the  schedule  for  the  loop 
by  a  reuse  clause.  In  this  case,  all  communication  schedules  required  for  the  loop  are  treated  as  a 
single  unit  and  are  subject  to  the  reuse  semantics. 

Example  1  Unconditional  Schedule  Reuse  Without  Schedule  Variable 

DO  t=l,maxTime 

!HPF+  INDEPENDENT,  ON  HOME(Y(I)),  REUSE 
L:  DO  1=1, N 

X(IX1(I))  =  X(IX2(I))-fY(I)*Y(I)/(Z(IXl(I))-U(IXl(I), 1X2(1))) 

END  DO 
END  DO 

In  the  above  code,  loop  L,  is  executed  max_times.  By  specifying  the  REUSE  attribute  in  the 
INDEPENDENT  directive,  the  user  is  indicaMng  thai  the  access  paMems  for  all  arrays  are  not 
going  to  change  across  multiple  executions  of  the  loops.  That  is,  the  values  of  the  arrays  used  as 
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index  vectors,  IX 1  and  1X2,  along  with  the  distributions  of  the  arrays  X,  Y  and  Z  remain  invariant 
during  the  execution  of  the  outer  loop.  Thus,  loop  L,  when  entered  first,  will  activate  inspector 
analysis  to  determine  the  read  patterns  for  X,  Y,  Z,  and  U,  and  the  write  pattern  for  X.  For  all 
subsequent  executions,  the  schedules  determined  during  tiie  first  execution  will  be  reused.  □ 


In  our  experience,  many  codes  which  use  indirection  vectors  to  access  arrays  do  not  change  the 
values  of  the  vectors  during  execution  once  they  have  been  initialized.  For  example,  an  unstructured 
grid  code  which  is  non-adaptive  will  read  in  the  grid  points  and  set  \ip  the  grid  interconnections 
in  the  initialization  phase  of  the  program.  Since  the  grid  does  not  change  during  the  execution  of 
the  program  these  interconnections  and  consequently  the  index  vectors  representing  the  neighbors 
remain  invariant  throughout  the  program.  In  such  situations,  the  REUSE  construct  provides  the 
user  with  a  simple  mechanism  to  assert  this  fact  to  the  compiler,  which  then  can  organize  the  reuse 
of  the  communication  schedules  once  computed  for  the  loop.  Note  that  the  directive  will  work  even 
if  the  outer  loop  is  in  one  procedure  which  repeatedly  calls  another  procedure  containing  the  the 
parallel  loop.  That  is,  in  the  above  code,  the  loop  L  can  be  in  a  different  procedure  which  is  called 
from  the  outer  loop. 

In  cases  when  the  index  vectors  are  changing  or  the  arrays  themselves  are  being  dynamically 
redistributed,  a  simple  generalization  of  the  REUSE  construct  allows  the  specification  of  a  con¬ 
dition  for  reuse.  This  condition  is  evaluated  whenever  a  schedule  for  the  loop  has  already  been 
computed  by  a  previous  execution.  If  it  yields  TRUE ,  the  old  schedule  is  reused;  otherwise  a  new 
schedule  is  computed. 

Example  2  Conditional  Schedule  Reuse  IVlthout  Schedule  Variable 

LOGICAL  USE-OLD 

DO  t=l, max-time 

!HPF-I-  INDEPENDENT,  ON  HOME(C(I)),  REUSE (USE.OLD) 

L:  DO  1=1, N 

■"  A(I)  =  B(IX(I))  +  C(I) 

END  DO 

USE.OLD  =  .TRUE. 

IF  RECONFIGURE(...) 

THEN  CALL  RECOMPUTE(IX);  USE-OLD=.  FALSE . 

END  IF 

END  DO 

Here,  the  logical  variable  USE.OLD  is  used  to  further  control  the  generation  of  the  schedule.  That 
is,  its  value  is  set  to  FALSE  when  the  index  vector,  IX  is  changed.  Note  that  if  the  schedule  is 
undefined,  wkick  will  be  the  case  the  first  time  that  the  loop  L  is  executed,  the  value  of  the  associated 
condition  is  ignored.  In  subsequent  executions,  the  value  of  the  condition,  the  logical  variable  here, 
will  control  the  generation  of  a  new  schedule.  C 
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4.3  Using  Schedule  Variables 

In  this  section,  we  illustrate  the  use  of  schedule  variables  for  the  specification  of  schedule  reuse. 
We  cover  conditional  and  unconditional  reuse,  and  reuse  across  different  loops. 

Schedule  variables  may  be  associated  with  arrays  in  the  context  of  a  parallel  loop  by  means  of 
the  gather  directive  and  the  scatter  directive.  Both  directives  establish  a  def-use  context. 

The  gather  directive,  in  its  simplest  form,  has  the  syntax 

GATHER(4i,...  ::  5) 

where  5  is  a  schedule  variable,  and  Ai,...,An  denote  arrays  with  identical  shapes  and  distributions, 
and  equivalent  read  patterns  in  the  loop.  This  construct  associates  S  with  the  corresponding  set 
of  equivalent  ALPs.* 

Syntax  and  semantics  for  the  scatter  directive  are  similar,  except  that  the  keyword  SCATTER 
is  used  and  the  direction  of  the  communication  is  reversed. 

No  array  may  appear  in  more  than  one  gather  or  more  than  one  scatter  directive  associated 
with  the  same  loop.  Note,  however,  that  a  given  array  may  be  associated  with  different  schedule 
functions  in  a  gather  and  scatter  schedule. 

Example  3  Unconditional  Schedule  Reuse  in  a  Loop 

!HPF+  SCHEDULE::  S 

DO  t=l, max-time 

!HPF+  INDEPENDENT,  ON  HOME(C(I)),  GATHER  (B::S) 

L:  DO  I=1,N 

A(I)  =  B(IX(I))  +  C(I) 

END  DO 
END  DO 

At  the  time  execution  reaches  the  independent  loop  first,  the  value  of  S  is  undefined.  At  that  Ume, 
the  GATHER  clause  causes  the  execution  of  the  inspector  for  the  accesses  to  B  in  the  loop,  and  the 
assignment  of  the  resulting  schedule  to  S.  On  subsequent  iterations  of  the  outer  loop,  the  schedule 
bound  to  S  is  reused  for  B. 

The  work  distribution  of  L,  as  specified  by  the  ON  clause,  is  given  by  w^(7)  =  S^{I)  for  all 
I  =  i.e.,  iteration  I  is  performed  on  the  processor  that  owns  element  I  of  array  C.  The 

ALP  for  B  is  given  as  {L,B,ir,R),  where  the  pattern  function,  x,  is  specified  as  n{I)  =  {IX{I)} 
for  all  I.  The  resulting  array  schedule  for  B  is  (B,<r,  J?),  where  (see  Lemma  1)  for  each  p  e  P  : 

a(p)  :=Ui6x-(p)/V(7)-A«(p). 

□ 

'This  association  of  S  can  be  extended  by  other  gather  or  scatter  directives  in  the  context  of  the  same  loop. 
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If  a  schedule  is  to  be  reused  across  subsequent  procediire  invocations,  it  has  to  be  declared  with  the 
save  attribute.  This  is  shown  in  the  following  example,  which  uses  a  slightly  more  general  access 
pattern:  the  pattern  function,  w,  for  B  is  now  7r(7)  =  {7X1(7),  7X2(7)}. 

Example  4  Reusing  Schedules  in  Procedures 

DO  t=l,max-time 

CALL  IRREG(B,IX1,IX2) 

END  DO 

SUBROUTESTE  IRREG(X,IX1,IX2) 

IHPF+  SCHEDULE,  SAVE  ::  S 

!HPF+  INDEPENDENT,  ON  HOME(C(I)),  GATHER  (B::S) 

L:  DO  I=1,N 

A(I)  =  B(IX1(I))  +  B(IX2(I))*C(I) 

END  DO 

END  SUBROUTINE 


Another  mechanism  for  reusing  schedules  in  procedures  is  to  declare  them  as  global  variables 
in  modules,  thus  implicitly  saving  them  between  calls.  If  schedules  became  first  class  objects  in  the 
base  language,  then  they  could  be  declared  at  an  appropriate  level  and  passed  down  the  call  chain 
similar  to  any  other  data  structure. 

By  means  of  the  reset  directive,  the  definition  status  of  a  schedule  variable  may  be  set  to 
undefined.  This  allows  the  control  of  schedule  reuse  depending  on  runtime  conditions.  We  slightly 
modify  Example  3  to  illustrate  this. 

Example  5  Conditional  Schedule  Reuse 
!HPF+  SCHEDULE::  S 
DO  t=l,max-time 

1HPF+  INDEPENDENT,  ON  HOME(C(I)),  GATHER  (B:;S) 

L:  DO  I=1,N 

A(I)  =  B(IX(I))  +  C(I) 

END  DO 

IF  RECONFIGURE(...)  THEN 
CALL  RECOMPUTE(IX) 

1HPF+  RESET  S 
END  IF 
laMD  DO 
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As  in  Example  S,  when  the  independent  loop  is  encountered  first  during  execution,  the  value  of  S  is 
undefined,  and  the  gather  clause  results  in  the  execution  of  the  inspector  and  the  assignment  of  the 
resulting  schedule  to  S.  On  subsequent  iterations  of  the  outer  loop,  the  value  bound  to  S  is  reused  as 
long  as  the  logical  function  RECONFIGURE  yields  FALSE .  If  the  reference  to  RECONFIGURE 
evaluates  to  TRUE,  the  reset  directive  sets  the  definition  status  of  S  to  undefined,  and  the  inspector 
is  executed  anew  when  the  independent  loop  is  encountered  in  the  next  iteration  of  the  outer  loop.a 

The  previous  examples  have  shown  how  a  single  array  can  be  associated  with  a  schedule  variable. 
Below,  we  generahze  Example  3  by  showing  how  to  associate  the  same  schedule  variable  with 
different  arrays.  If,  in  such  a  situation,  an  inspector  has  to  be  executed,  the  implementation  selects 
one  of  the  arrays  for  performing  the  inspector  analysis.  The  language  does  not  specify  the  selection 
criterion. 

Example  6  Multiple  Use  of  a  Schedule 

IHPF+  SCHEDULE:;  S 


DO  t=l, max-time 

!HPF+  INDEPENDENT,  ON  HOME(C(IX(I))),  SCATTER ( A: :S),  GATHER (B,C::S) 

L:  DO  1=1, N 

A(IX(I))  =  B(IX(I))  +  C(IX(I)) 

END  DO 
END  DO 

This  example  differs  from  Example  3  in  that  cil  three  arrays  are  accessed  irregularlyj  using  ihe 
saT7i>e  indirection  a>rr(iy  IX  and,  thus  the  same  pattern  function^  ho,sed  on  the  assumption  tho,t  all 
three  arrays  are  distributed  identically.  When  execution  encounters  the  independent  loop  firstj  S  is 
undefined.  One  of  the  three  arrays  is  then  selected  for  inspector  analysis,  □ 


Example  7  Using  Multiple  Schedule  Variables  in  a  Loop 

DO  t=l,max-time 

CALL  SUB(X,Y,Z,U,V,IX1,IX2) 

END  DO 

SUBROUTINE  SUB(X,Y,Z,U,V,IX1,IX2) 

!HPF+  SCHEDULE,  SAVE  S1,S2,S3 

!HPF+  INDEPENDENT ,  ON  HOME(Y(I)),  GATHER {X;:S2,Z:;S1,U::S3),  SCATTER  (X::S1) 
L;  DO  I=1,N 

X(IX1(I))  =  X(IX2(I))+Y(I)*Y(I)/(Z(IX1(I))-U(IX1(I), 1X2(1))) 
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END  DO 


END  SUBROUTBNfE  SUB 

In  this  example 

•  SI  is  associated  with  the  ALPs  {L,Z,Tri,R)  and  (L,X,7ri, PU),  where  ni{I)  =  {IX1{I)}  for 
I  =  l,...N. 

m  S2  is  associated  with  the  ALP  {(L,  X, 7r2,i?)},  where  Jr2(/)  =  {IX2{I)}  for  I  =  1, . . .  N. 

•  53  is  associated  with  the  ALP  {{L,U,'jr3,R)},  where  kz{I)  =  {7X1(7), 7X2(7)}  for  I  = 

The  loop  distribution  is  determined  by  the  loop  iteration  range  [1  :  N]  and  the  associated  on 
clause  ON  HOME(Y(I))  (see  Example  3). 

This  example  is  only  correct  if  6^=5^.  □ 

In  the  following,  we  extend  Example  7  by  showing  how  to  apply  a  schedule  computed  in  one  loop 
to  another  loop. 

Example  8  Using  a  Schedule  Across  Different  Loops 
IHPF+  SCHEDULE  ::  S1,S2,S3 
1HPF+  INDEPENDENT,  ON 

HOME(Y(I)),  GATHER  (X;:S2,Z::S1,U::S3),  SCATTER  (X;;S1) 

LI:  DO  I=1,N 

X(IX1(I))  =  X(IX2(I))+Y(I)*Y(I)/(Z(IX1(I))-U(IX1(I), 1X2(1))) 

END  DO 

IHPF+  INDEPENDENT ,  ON  HOME(Y(I)),  GATHER (X,V::S1).  SCATTER (V:;S2) 

L2:  DO  1=1, N 

V(IX2(I))  =  X(IX1(I))+Y(I)*Y(I)*V(IX1(I)) 

END  DO 


Here,  the  values  to  which  SI,  52,  and  53  are  bound  via  the  gather  and  scatter  clauses  associated 
with  LI  are  the  same  as  for  loop  L  in  the  previous  example.  51  and  52  are  reused  in  L2.  When 
L2  is  entered  for  the  first  time,  all  schedule  variables  used  in  this  loop  already  have  a  defined  value, 
and  no  inspector  analysis  is  required.  D- 

By  reusing  a  schedule  variable  in  two  different  loops,  the  user  is  guaranteeing  that  the  core  of 
the  schedule,  i.e.,  the  index  domain,  the  schedule  function  and  the  data  distributions  involved  are 
the  same.  The  compiler  (and  the  runtime  system)  just  reuse  the  schedules  as  specified  without 
checking.  Thus,  the  reuse  of  51  and  52  in  the  above  example  would  not  work  if  the  on  clauses 
specified  in  LI  and  L2  were  different  or  if  the  distributions  of  X  and  V  were  not  the  same. 
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4.4  Schedule  Assignment  Directive 

A  schedule  assignment  directive  has  the  form  S  =  schedule-expression,  where  5  is  a  reference  to 
a  schedule  variable.  Such  an  assignment  is  executed  by  evaluating  the  schedule-expression  and 
binding  its  result  to  S. 

For  the  purpose  of  this  paper,  we  consider  only  two  cases  for  a  schedule  expression:  first,  a 
reference  to  a  schedule  variable,  and  second,  a  union  of  schedules,  which  uses  the  operator  symbol 

a  j  » 

If  the  schedule  expression  is  a  reference  to  a  schedule  variable,  then  the  result  of  evaluating  it 
is  the  value  to  which  the  variable  is  currently  bound. 

If  the  schedule  expression  is  a  union,  51  +  52,  then  assume  first  that  51  and  52  both  are 
defined.  51  and  52  must  be  associated  with  the  same  array  index  domain,  I,  and  the  same  array 
distribution,  6,  resulting  in  respective  cores  (I,J,<ti)  and  (I,5,cr2)-  Then  the  value  yielded  by  the 
expression  is  characterized  by  the  core  (I,<S,£ri  U  ^2),  and  the  set  of  associated  array  schedulas  is 
empty.  If  51  or  52  are  undefined,  or  the  expression  is  not  well-defined,  then  the  schedule  expression 
returns  undefined. 

Example  9  Schedule  Assignment 
IHPFH-  SCHEDULE  S1,S2,S3 

!HPF-t-  INDEPENDENT , ON  HOME(Y(I)), GATHER (X:;S2,Z:;S1), SCATTER (X::S1) 

LI:  DO  I=1,N 

X(IX1(I))  =  X(IX2(I))-H(Z(IX1(I)) 

END  DO 

!HPF-)-  S3=S1+S2  !  Schedule  assignment,  computing  the  union  of  schedules  SI  and  S2 

!HPF-f-  INDEPENDENT, ON  HOME(Y(I)), GATHER (X::S1,V::S3), SCATTER (V;:S2) 

L2;  DO  1=1, N 

V(IX2(I))  =  X(IXl(I))+Y(I)*Y(I)*(V(IXl(I)-t-V(IX2(I))) 

EIND  DO 

The  schedule  variable  53  *s  defined  by  a  schedule  assignment,  which  computes  the  union  of  the 
schedules  associated  with  51  and  52.  53  is  associated  with  the  ALP  (L2|  where  jts  = 

TTi  U  Jr2. 


5  Unstructured  Mesh  Multigrid  Example 

In  this  section  we  present,  in  relative  detail,  a  more  concrete  example,  using  multigrid  techniques 
on  an  unstructured  mesh.  We  show  how  the  features  described  above  can  be  utilized  to  specify  the 
reuse  of  communication  schedules  for  the  multiple  levels  of  the  unstructured  mesh.  Note  that  the 
code  shown  here  is  not  complete  for  the  sake  of  brevity  and  clarity.  In  particular,  we  do  not  show 
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!  Type  Declarations 


TYPE  vert 

INTEGER  id 

INTEGER  bdry_flag 

INTEGER  par_a,par-.b,par-c 

REAL  ca,cb,cc 

REAL  sol 

REAL  old-sol 

REAL  delta 

REAL  res 

REAL  f 

END  TYPE  vert 


!  boundary  flag 

!  vertices  of  parent  cell 
!  interpolation  coefficients 
!  current  solution 
!  last  solution 
!  change  in  solution 
!  residual 
!  forcing  function 


TYPE  edge 

INTEGER  id 

INTEGER  va,  vb  !  vertices 

END  TYPE  edge 


TYPE  grid  type 

INTEGER  nedge 
INTEGER  nvert 

TYPE  (edge),  POINTER,  DIMENSION  (:)  ::  elist 
TYPE  (vert),  POINTER,  DIMENSION  (:)  ::  vlist 
END  TYPE  grid_type 


Figure  1:  Multigrid  on  an  Unstructured  Mesh:  Type  Declarations 

any  of  the  distributions  since  the  discussion  here  Is  independent  of  the  actual  distribution  used  for 
the  data  structures. 

Figure  1  shows  the  global  type  declarations  used  for  an  unstructured  mesh  including  those  for  a 
vertex  containing  indices  of  the  parent  cell  in  a  finer  mesh,  an  edge  with  its  two  vertices,  and  a  grid 
containing  a  list  of  edges  and  and  a  list  of  vertices.  The  main  program  generates  the  unstructured 
meshes,  grids,  at  nZeu  +  l  levels  setting  up  the  interconnections  between  the  meshes  and  initializing 
the  forcing  function  of  the  mesh  at  the  finest  level,  i.e.,  grids{0).  It  repeatedly  calls  the  routine 
cycle  to  execute  one  multigrid  cycle.  The  routine  cycle  (also  shown  in  Figure  2)  conducts  one  V- 
cycle  of  the  multigrid  algorithm.  It  “relaxes”  the  grid  at  each  level  while  calling  the  rest  to  restrict 
the  values  from  a  fine  grid  to  a  coarse  grid.  Then,  it  uses  the  routine  prolo  to  interpolate  values 
from  a  coarse  grid  to  a  finer  grid.  The  code  until  this  point  does  not  require  any  communication, 
except  possible  synchronization  required  to  initially  generate  an  unstructured  mesh  in  parallel. 

The  routine  relax,  shown  in  Figure  3  performs  a  relaxation  on  an  unstructured  mesh  by  sweeping 
over  the  edges  of  the  mesh  and  updating  the  values  at  the  vertices  of  each  edge  based  on  the  old 
values  at  the  vertices.  The  loop  is  declared  to  be  independent  and  each  iteration  is  to  be  executed 
on  the  processor  which  owns  the  edge  being  computed  upon.  This  could  require  gathering  up  the 
values  at  vertices  which  do  not  reside  on  the  same  processor  as  the  edge,  computing  the  new  values 
and  then  scattering  the  values  to  the  owning  processors.  Since  each  call  to  relax  works  on  a  mesh 
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at  a  different  level,  an  HPF  compiler  would  have  to  r^enerate  the  schedule  each  time  unless  it 
could  perform  interprocedural  analysis  to  determine  that  there  are  nlev  +  1  different  meshes  each 
requiring  a  different  schedule.  Such  analysis  is  complex  and  is  currently  not  avaUable  in  any  HPF 
compiler. 

Using  the  features  described  in  the  last  section,  the  user  can  indicate  the  required  number  of 
schedules  and  when  to  reuse  them.  Thus,  two  arrays  of  nlev  +  1  schedules  are  declared  in  the 
specification  part.  On  each  call  to  relax,  the  current  level  Iv  is  also  passed  in  and  is  used  to  index 
the  appropriate  schedule  for  gathering  values  from  the  elist  and  vlist  and  scattering  data  to  the 
vlist  At  each  mesh  level,  the  first  call  to  relax  would  encounter  an  undefined  schedule  and  hence 
the  inspector  would  be  called  to  generate  the  schedule.  Since  the  schedules  are  declared  with  save 
attribute,  on  subsequent  cycles  of  the  multigrid,  the  old  schedule  at  each  level  can  be  reused  without 
executing  the  inspector.  Again,  if  schedules  were  first  class  objects  in  the  language,  they  could  be 
declared  at  the  top,  e.g.,  in  the  gridJype  itself,  and  then  passed  down  the  call  chain. 

Figure  4  shows  the  prolongation  routine,  prolo  and  the  restriction  routine,  rest.  These  are 
similar  to  relax  except  they  “translate”  values  between  the  vertex  lists  of  meshes  at  two  levels. 
The  code  structure  is  similar  to  the  routine  relax  and  similar  declarations  of  schedule  arrays  can  be 
used  to  specify  the  reuse  of  the  schedules.  The  only  difference  is  that  we  need  only  nlev  schedules 
here  since  each  schedule  handles  the  data  transfer  between  two  levels.  Again,  without  the  schedule 
specification  the  compiler  would  either  have  to  regenerate  the  schedules  on  each  call  or  carry  out 
complex  inter-procediiral  analysis  to  determine  the  right  schedule  to  use  on  each  call. 


6  Explicit  Pattern  Specification 

Until  now,  we  assumed  that  a  schedule  is  originally  computed  by  an  inspector,  which  preprocesses 
the  loop  at  runtime  by  analyzing  its  array  access  patterns.  Based  on  the  access  pattern  for  an 
array,  the  inspector  can  determine  the  associated  array  schedule  by  taking  into  account  the  work 
distribution  of  the  loop  and  the  data  distribution  for  the  array. 

Although,  in  principle,  this  approach  is  always  possible,  inspector  analysis  may  become  very 
costly  if  nested  loops  and  procedure  calls  occur  in  the  parallel  loop.  In  this  section,  we  discuss  lan¬ 
guage  features  for  explicit  pattern  specification  to  support  the  user  in  a  situation  where  inspector 
computation  is  impractical  or  highly  inefficient  and  the  user  knows  the  access  patterns  used  in  the 
application. 

The  evaluation  of  a  pattern  specification  results  in  a  pattern  function  (see  Definition  4).  In  the 
context  of  a  parallel  loop,  a  pattern  function  can  be  combined  with  an  array  and  a  direction  to 
form  an  ALP,  from  which  a  schedule  can  be  determined. 

We  are  currently  studying  how  to  extend  the  simple  patterns,  which  can  be  specified  based 
upon  Fortran  90  array  reference  syntax,  to  provide  a  more  ^neral  specification  capability  in  a  way 
similar  to  Vienna  Fortran’s  user-defined  distributions  [1,  19,  4]. 

6.1  Simple  Patterns 

A  simple  pattern  can  be  specified  in  the  context  of  a  gather  or  scatter  clause  associated  with  a 
parallel  loop  L.  We  extend  the  syntax  of  these  clauses,  as  introduced  in  the  last  section,  by  allowing 
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a  pattern,  specification  in  the  place  of  the  schedule  variable,  possibly  combined  with  an  assignment 
to  the  schedule  variable.  We  illustrate  this  using  the  gather  clause. 

GATHER  (>li, . . . ,  yl„  ::  [S  =]  pattern-spec) 

The  pattern-spec  has  the  form 

PATTERN  {jpattem-element,... pattern-element) 

where  each  pattern-element  is  a  parenthesized  list  of  section-subscripts,  one  for  each  dimension  of 
the  arrays  A,,  all  of  which  must  have  the  same  rank.  At  least  one  section  subscript  must  contain 
a  reference  to  a  do  variable  of  L.  Vector  subscripts  occurring  in  patterns  are  not  restricted  to 
one-dimensional  arrays,  as  in  Fortran  90. 

Let  R  denote  the  iteration  domain  of  L,  and  I  the  common  index  domain  of  the  Ai.  Then  a 
pattern-spec  is  evaluated  in  the  following  way: 

1.  Evaluate  the  pattern  elements,  yielding  pattern  functions  tti,  . , .  all  of  which  map  R  to 
the  powerset  of  I. 

2.  Define  the  pattern  function,  jt,  for  the  pattern-spec  as  the  union  n  := 

3.  Bind  tt  to  all  this  produces  a  set  of  equivalent  ALPs:  {(L,  Ai,7r,  R), ...  ,{L,  A„,7r,  J2)}. 

4.  If  a  schedule  variable,  5,  has  been  specified,  then  bind  S  to  the  set  of  ALPs  determined  above. 

The  pattern  function  for  a  pattern  element  is  evaluated  by  determining  the  values  of  all  vari¬ 
ables  occurring  in  the  section  subscripts,  which  are  not  do  variables  of  L  and  replacing  them  by 
their  values. 


Ebcample  10  Simple  Pattern  I 

SUBROUTINE  SWEEP  (X,Y, EDGE) 

IHPF-I-  SCHEDULE,  SAVE  ::  S 

IHPF-I-  INDEPENDENT,  ON  HOME(EDGE(I,l)),  REDUCTION (Y),  & 
IHPF+  GATHER (X,Y::S=  PATTERN ((EDGE(1, 1:2)))),  SCATTER(Y::S) 
L:  DO  I=1,NEDGE 

N1=EDGE(I,1);  N2=EDGE(I,2);  DELTAX  =  F(X(N1),X(N2)) 

Y(N1)  =  Y(N1)  -  DELTAX 
Y(N2)  =  Y(N2)  +  DELTAX 

END  DO 


END  SUBROUTINE  SWEEP 
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The  simple  pattern  specification  in  this  example  consists  of  one  pattern  element,  which  in  turn 
is  a  list  with  one  section  subscript,  EDGE((I,1:2).  This  determines  a  pattern  function,  n,  with 
7r(/)  =  {EDGE{I,1),  EDGE{I,2)}  for  all  I  =  1,...,NEDGE.  The  associated  set  of  ALP s  is 
{(L, X,  TT,  R),  (L,  y, x,  R),  (L,  y, TT,  W)} . 

This  computation  is  only  performed  upon  Hie  first  execution  of  L.  All  subsequent  executions 
reuse  this  schedule.  No  inspector  analysis  has  to  be  performed.U 

The  example  below  illustrates  a  pattern  consisting  of  more  than  one  pattern  element.  It  ex¬ 
presses  the  patterns  of  the  second  loop  in  Example  8  explicitly,  without  using  schedule  variables. 

Example  11  Simple  Pattern  II 

DO  t=l,max  time 

CALL  SUB2(X,Y,Z,U,V,IX1,IX2) 

END  DO 

SUBROUTINE  SUB2(X,Y,Z,U,V,IX1,IX2) 

!HPF+  INDEPENDENT , ON  HOME(Y(I)),GATHEai(X:;PATTERN  ((IXl(I))),  & 

!HPF+  V:;  PATTERN  ((IXl(I),  (1X2(1)))),  SCATTER (V:;  PATTERN  ((1X2(1)))) 

L2:  DO  I=1,N 

V(IX2(I))  =  X(IX1(I))+Y(I)*Y(I)*(V(IX1(I)+V(IX2(I))) 

END  DO 

END  SUBROUTINE  SUB2  C 

7  Implementation 

In  this  section  we  describe  the  language  features  and  compilation  techniques  provided  by  the  Vienna 
Fortran  Compilation  System  (VFCS  [2])  for  the  parallelization  of  irregular  applications.  We  outline 
the  main  steps  in  transforming  independent  do  loops  with  irregular  data  accesses  according  to  the 
inspector-executor  strategy  [17,  13,  16]  and  present  performance  results  for  a  benchmark  kernel 
which  has  been  extracted  from  a  crash-simulation  code.  A  hand-optimized  version  of  the  parallel 
code  shows  that  communication  schedule  reuse  is  crucial  to  amortize  the  overhead  of  this  run-time 
parallelization  strategy. 

7.1  Inspector-Executor  Parallelization  Strategy 

VFCS  supports  the  HPF-2  INDEPENDENT  directive  together  with  the  NEW  and  REDUCTION  clauses. 
Moreover,  the  user  may  specify  the  mapping  of  loop  iterations  to  processors  by  means  of  the  ON  and 
ON  HOME  clauses.  Arrays  accessed  within  independent  loops  may  be  distributed  using  the  BLOCK, 
GEN_BL0CK  or  INDIRECT  distribution  formats.  Indirection  arrays  may  be  distributed  and  arbitrarily 
nested. 
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The  process  of  parallelizing  irregular  loops  combines  compile  time  parallelization  techniques  with 
runtime  analysis.  Information  from  data-flow  analysis,  data  dependence  analysis,  data  distribution 
and  overlap  analysis  is  utilized.  The  runtime  support  of  VFCS  for  irregular  loops  is  based  on  an 
enhanced  version  of  the  PARTI  ([6])  and  CHAOS  ([18])  runtime  routines. 

VFCS  transforms  irregular  independent  loops  into  three  main  phases:  the  work  distributor,  the 
inspector^  and  the  executor. 

According  to  the  on-clause  of  an  independent  loop  L  the  work  distributor  calculates  on  each 
processor  p  the  set  x^(p)  of  loop  iterations  to  be  executed.  Depending  on  its  structure,  the  execution 
set  is  either  represented  as  a  regular  section  by  a  triplet  or  as  a  one-dimensional  integer  array. 

The  inspector  performs  a  runtime  analysis  of  the  loop  in  order  to  determine  for  each  distributed 
array  the  required  communication  schedules.  This  is  accomplished  by  iterating  on  each  processor 
over  the  computed  execution  set,  evaluating  the  subscripts  of  all  distributed  arrays,  and  storing 
the  indices  in  global  reference  lists.  These  lists  are  passed  together  with  the  corresponding  lay¬ 
out  descriptors  to  library  procedures  that  return  the  resulting  gather /scatter  schedules  and  the 
corresponding  local  reference  lists.  In  the  case  of  a  multi-level  indirection,  the  array  accesses  are 
processed  in  multiple  phases,  starting  with  the  innermost  arrays. 

For  each  nonlocal  data  item  accessed  in  a  processor,  a  local  buffer  element  is  reserved.  The 
executor  phase  begins  by  gathering  all  nonlocal  read  data  from  remote  processors,  according  to  the 
gather  schedules  determined  by  the  inspector,  and  placing  them  into  the  associated  buffer  elements. 
Following  this  transfer,  the  transformed  loop  body  is  executed;  nonlocal  data  are  read  and  written 
using  their  associated  buffer  elements.  After  all  iterations  have  been  performed,  all  nonlocal  data 
that  were  written  in  a  processor  are  scattered  to  their  owner  using  the  scatter  schedules  determined 
by  the  inspector. 

7.2  Performance  Results 

We  present  performance  results  for  a  finite-element  kernel  that  has  been  developed  in  the  context 
of  the  HPF-I-  project.  The  kernel  represents  the  basic  stress-strain  calculation  of  a  crash-simulation 
code  based  on  4-node  shell  elements.  It  uses  an  explicit  time-marching  scheme  which  is  represented 
in  the  kernel  by  an  outer  loop  performing  250  iterations.  A  simplified  structure  of  the  kernel  is 
shown  in  Figure  5. 

From  the  time-step  loop  in  the  main  program  the  subroutine  KFORCE  is  called  with  distributed 
arguments.  The  corresponding  actual  arguments  inherit  the  mapping  and  thus  no  data  motion  is 
required  at  the  procedure  boundary.  The  main  variables  used  in  the  kernel  (F,  A,  and  V)  represent 
the  forces,  accelerations,  and  velocities  at  nodal  points,  respectively.  X  represents  the  coordinates 
of  nodal  points,  and  IX  captures  the  connectivity  of  the  elements  in  the  unstructured  mesh. 

The  first  independent  loop  in  KFORCE  represents  the  element-wise  force  calculation.  Before  the 
call  to  MFORCE  all  required  data  is  gathered  into  private  temporary  variables  which  are  declared 
as  NEW  in  the  INDEPENDENT  directive.  The  communication  required  for  gathering  distributed  data 
into  the  temporaries  is  determined  at  runtime  by  means  of  inspectors.  The  major  part  of  the 
computational  cost  of  the  algorithm  is  within  the  procedure  MFORCE  which  operates  on  processor- 
private  data  only,  and  thus  does  not  induce  communication.  The  second  independent  loop  performs 
a  sum-scatter  operation  to  add  back  the  elemental  forces  to  the  forces  stored  at  the  nodes. 

The  kernel  used  in  our  evaluation  employed  an  unstructured  mesh  with  35571  nodes  and  35000 
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Processors 

Total  (Unoptimized) 

Inspector 

Gather 

Scatter 

Speed  Up 

1  (seq.) 

545.45 

- 

- 

- 

- 

2 

598.98 

347.63  (58%) 

2.31 

2.38 

0.9 

4 

304.75 

172.48  (57%) 

4.33 

6.22 

1.8 

8 

160.19 

89.60  (56%) 

5.33 

7.81 

3.4 

16 

100.04 

50.66  (51%) 

8.68 

10.78 

5.5 

32 

80.99 

36.34  (45%) 

10.78 

11.65 

6.7 

Table  1:  Times  for  crash  kernel  without  schedule  reuse. 


Processors 

Total  (Schedule  Reuse) 

Inspector 

Speed  Up 

Unoptimized/Schedule  Reuse 

2 

285.22 

1.47 

1.9 

2.1 

4 

153.00 

0.81 

3.6 

2.0 

8 

79.10 

0.47 

6.9 

2.0 

16 

47.47 

11.5 

2.1 

32 

35.31 

15.4 

2.3 

Table  2:  Times  for  crash  kernel  with  schedule  reuse  (hand  optimized). 

elements*  A  total  of  20  distributed  arrays  were  used  for  which  VFCS  generated  9  different  inspector 
phases  within  the  KFORCE  procedure.  Table  1  shows  the  time  measurements  obtained  for  the  un¬ 
optimized  kernel  as  parallelized  by  VFCS  and  executed  on  the  Meiko  CS-2  for  2  to  32  processors^ 
The  second  column  of  the  table  shows  the  total  time  spent  in  executing  the  procedure  KFORCE.  As 
can  be  seen,  approximately  50  %  of  the  total  execution  time  is  due  to  the  overhead  of  the  inspector 
phases  for  computing  the  required  communication  schedules.  Thase  schedules  are  recomputed  in 
every  incarnation  of  the  procedure  since  the  system  fails  to  detect  the  invariance  of  the  commu¬ 
nication  patterns.  Note  that  in  a  real  simulation  the  overhead  induced  by  the  repeated  execution 
of  inspectors  would  be  orders  of  magnitude  higher  since  the  number  of  time-steps  is  usually  in  the 
range  of  10^ . 

Table  1  also  shows  the  accumulated  time  spent  with  gather  and  scatter  communications,  re¬ 
spectively.  Since  the  communication  is  unstructured  and  involves  all  processors,  the  fraction  of 
time  spent  with  communication  increases  almost  linearly  with  the  number  of  processors  from  less 
than  1%  on  2  processors  to  28%  on  32  processors. 

Table  2  shows  the  total  time  spent  in  procedure  KFORCE  where  all  communication  schedules 
were  computed  in  the  first  iteration  of  the  time^step  loop  and  reused  in  subsequent  iterations. 
This  optimization  has  been  carried  out  by  manually  adapting  the  code  generated  by  VFCS.  As  the 
timings  show,  the  overhead  of  the  time  spent  in  computing  communication  schedules  is  reduced  to 
less  than  1%  and  results  in  a  performance  improvement  of  more  than  a  factor  of  2. 

^The  time  for  one  processor  refers  to  the  HPF+  program  compiled  with  the  SUN  FORTRAN  77  compiler  Version 
3.0.1  with  optimization  level  -03  and  execiited  on  a  single  node. 
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8  Related  Work 


The  inspector/executor  runtime  preprocessing  technique  was  initially  implemented  in  the  Kali  [13] 
compiler.  The  PARTI  [6]  runtime  library  was  developed  to  provide  runtime  support  for  a  class  of 
irregular  problems  characterized  by  a  sequence  of  concurrent  computational  phases,  where  patterns 
of  data  access  and  computational  cost  of  each  phase  cannot  be  predicted  until  runtime.  They  were 
designed  to  ease  the  implementation  of  irregular  problems  on  distributed  memory  parallel  archi¬ 
tectures  by  relieving  the  user  of  having  to  deal  with  many  low-level  machine  specific  issues.  The 
PARTI  routines  support  communication  schedule  computation,  global-to-local  index  transforma¬ 
tion  and  schedule-based  communication  generation.  The  CHAOS  library  [16,  18],  a  superset  of 
PARTI,  provides  additional  support  for  the  parallelization  of  adaptive  irregular  problems  where  in¬ 
direction  arrays  are  modified  during  the  course  of  the  computation.  A  number  of  research  compilers, 
including  the  Fortran  D  compiler  [9],  the  Vienna  Fortran  Compilation  System  [2],  and  others  [3] 
have  used  these  libraries  for  the  compilation  of  irregular  codes  based  on  the  inspector/executor 
approach. 

In  [16]  a  simple  run-time  technique  for  communication  schedule  reuse  is  presented  which  is 
based  on  a  global  time-stamping  method  to  keep  track  of  whether  indirection  arrays  that  influence 
the  communication  pattern  of  a  particular  loop  have  been  modified.  Each  inspector  checks  the 
corresponding  time-stamps  to  determine  whether  relevant  indirection  arrays  have  been  modified 
since  the  last  inspector  invocation. 

Hanxleden  [10]  developed  the  Give-N-Take  data  flow  framework  which  is  used  for  the  placement 
of  communication  statements  in  parallelized  programs.  He  uses  techniques  that  are  based  on 
partial  redundancy  elimination  and  symbolic  analysis  to  eliminate  redundant  inspectors  in  certain 
restricted  cases. 

First  proposals  for  applying  program  slicing  techniques  to  the  optimization  of  indirect  array 
accesses  were  made  in  [7,  8].  These  techniques  construct  program  slices  containing  the  subset  of 
statements  affecting  nonlocal  array  accesses  at  a  particular  program  point.  Multiple  indirection 
levels  can  be  eliminated  by  applying  a  flattening  transformation.  A  dedicated  program  analysis 
based  on  the  topological  sort  of  a  slice  graph  is  employed  for  the  elimination  of  redundant  slices. 
Most  of  these  techniques,  however,  are  restricted  to  certain  limited  forms  of  loop  bodies  and  fail  in 
the  presence  of  procedure  calls. 

The  PILAR  [14]  run-time  support  library  developed  in  the  context  of  the  PARADIGM  com¬ 
piler  [15]  aims  at  exploiting  regularity  in  irregular  accesses  by  using  an  interval-based  representation 
of  communication  schedules.  The  PARADIGM  system  relies  on  special  directives  to  guide  the  com¬ 
piler  in  placing  the  communication  in  the  presence  of  irregular  references. 

9  Conclusion 

In  this  paper,  we  described  a  set  of  language  features  that  allow  the  explicit  manipulation  of  com¬ 
munication  schedules  at  the  HPF  language  level.  Our  method  is  based  on  the  concept  of  schedule 
variables,  whose  values  are  communication  schedules  computed  by  an  inspector  or,  alternatively, 

defined  by  the  user  by  specifying  the  access  pattern  to  an  array  with  appropriate  high-level  direc¬ 
tives. 
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These  features  are  currently  being  implemented  in  the  framework  of  the  Vienna  Fortran  Com¬ 
pilation  System  (VFCS).  We  plan  to  apply  this  methodology  to  a  set  of  important  applications, 
evaluate  the  resulting  performance,  and  use  the  results  to  adjust  the  functionality  of  the  language 
extensions  accordingly. 
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PROGRAM  main 
INTEGER  nlev 

TYPE  (grid-type),  ALLOC ATABLE::  gTids(:) 
REAL  err 


!  allocates  grids (0:nleAhl)  and  the  edge  list,  elist,  and  the  vertex,  list,  vlist, 

!  at  each  level  based  on  input  values  and  sets  up  parent-child  and  neighbor  interconnections 
CALL  setnp(grids,  nlev) 

!  set  fine  grid  forcing  Junction 
CALL  fine_set(grids(0)% vlist) 

DO  WHILE  (err  .GE,  LOE-5) 

CALL  cycle(grids,  nlev) 
err  =  sqnorm  (grids  (0)%  vlist) 

END  DO 

END 

SUBROUTINE  cycle(grids,  nlev) 

TYPE  (grid-type)  ::  grids(:) 

INTEGER  nlev 
INTEGER  Iv,  Ivc 

I  relax  each  level,  and  perform  restriction,  moving  to  coarsest  grid 
DO  Iv  =  0,  nlev-1 
Ivc  =  Iv  +1 

CALL  relax(grids(lv)%vlist,  grids (lv)%nvert,  grids (lv)%elist,  grids(lv)%nedge,  Iv,  nlev) 
CALL  rest  (grids  (lv)%  vlist,  grids  (lv)%nvert,  grids  (Ivc)  %vlist,  grids  (Ivc)  %nvert,  Iv,  nlev) 

END  DO 

!  relax  each  level,  and  perform  prolongations,  moving  to  finest  grid 
DO  Iv  =  nlev-1,  0,  -1 
Ivc  =  Iv  +1 

CALL  relax(grids(lvc)%vlLst,  grids  (Ivc)  %nvert,  grids(lvc)%elist,  grids  (Ivc)  %nedge,  Iv,  nlev) 
CALL  prolo(grids(lv)%vlist,  grids(lv)%nvert,  grids(lvc)%vlist,  gTids(lvc)%nvert,  Iv,  nlev) 

END  DO 
END 


Figure  2:  Multigrid  on  an  Unstructured  Mesh:  Main  Program  and  Cycle  Routine 
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SUBROUTINE  relax(vlist,  nvert,  elist,  nedge,  Iv,  nlev) 

TYPE  (edge)  ::  elist  (nedge) 

TYPE  (vert)  ::  vlist  (nvert) 

INTEGER  nvert,  nedge 

INTEGER  Iv,  nlev 

INTEGER  ia,  ib 
REAL  om,  fix 

!HPF-h  SCHEDULE,  SAVE  ::  s(0:nlev-l),  g(0:nlev-l) 

!  initialize  residual  to  forcAng  funcMon 
vlist(:)%res  =  vlist(:)%f 

!  loop  over  edges  on  this  level  accumulating  residual  unto  vertices 

!HPF$  INDEPENDENT,  ON  (HOME (elist (e)),  NEW (ia,ib,ax),  REDUCTION  (vlist) 
!HPF+  GATHER(elist::s(lv)),  GATHER  (vlist: :g(lv)),  SCATTER  (vlist: :g(lv)) 

DO  e  =  1,  nedge 
ia  =  elist  (e)%va 
ib  =  elist  (e)%vb 
fix  =  flux(vIiBt(ia),  vlist  (ib)) 

vlist  (ia)%res  =  vlist  (ia)%res  fix 
vbst(ib)%res  =  vlist(ib)%res  -  fix 

END  DO 

!  loop  over  vertices  on  this  level  ^^relaxing^^  solution 
vlist(:)%sol  =  vlist(:)%sol  +  om  *  vlist(:)%res 

!  update  boundaries 

CALL  apply-bc( vlist,  nvert) 

END 


Figure  3:  Multigrid  on  an  Unstructured  Mesh:  Relaxation  Routine 
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SUBROUTINE  reat(fiiie_vlifit,  fiiie_nvert,  vlist,  nvert,  Iv,  nlev) 

TYPE  (vert)  ::  vliflt (nvert),  fine_vbst(ftiiejivert) 

INTEGER  Iv,  nlev,  nvert,  fme_nvert 

TYPE  (vert)  ::  va,  vb,  vc 

!HPF4-  schedule,  SAVE  ::  s(0:nlev-l),  g(0:nlev-l) 

!  zero  coarse  grid  right  forcing  function 
vlist(:)%f  =  0.0 
!  loop  over  fine  grid  vertices 

!HPF$  INDEPENDENT,ON(HOME(fliie-vlist(v)), NEW (va,vb.vc), REDUCTION (vlist(;)  %res) 
!HPF+  GATHER  (fine.vlist::s(lv)),  GATHER  (vlist :;g(lv)),  SCATTER  (vlist::g(lv)) 

DO  V  =  1,  fine_nvert 

va  =  fine_vlist(v)%par_a;  vb  =  fine_vlist(v)%par_b;  vc  =  fine.vlist(v)%par_c 
!  accumulate  residual  at  a  fine  grid  vertex  unto  its  coarse  grid  parent  vertices 
vlist(va)%res  =  vlist(va)%res  +  fine  vlist(v)%ca  *  fine  vlist(v)%res 
vliflt(vb)%refl  =  vligt(vb)%res  +  fine_vliBt(v)%cb  *  fine_vlist(v)%res 
vlist(vc)%res  =  vlist(vc)%res  +  fine-vlist(v)%cc  *  fine-vlist(v)%res 
END  DO 
END 

SUBROUTINE  prolo(fine_vlist,  fine_nvert,  vlist,  nvert,  Iv,  nlev) 

TYPE  (vert)  ::  vlist  (nvert),  fine_vli8t(fine_nvert) 

INTEGER  Iv,  nlev,  nvert,  fine_nvert 

TYPE  (vert)  ::  va,  vb,  vc 

IHPF+  SCHEDULE,  SAVE  ::  s(0:nlev-l),  g(0:nlev-l) 

!  loop  over  fine  grid  vertices 

!HPF$ 

INDEPENDENT ,  ON  (HOME(fiiie.vUst(v)),  NEW  (va,vb,vc),  REDUCTION  (fine.vUst(:)%sol) 
!HPF+  GATHER  (vlist::s(lv)),  GATHER(&ie-vlist::g(lv)),  SCATTER  (flne.vlist::g(lv)) 

DO  V  =  1,  fine-nvert 

va  =  fine_vlist(v)%par_a;  vb  =  fine_vlist(v)%par.b;  vc  =  fine_vlist(v)%par.c 
!  linearly  interpolate  values  in  vertices  va,vbjVc  and  update  v 

fine. vlist (v)%aol  =  flne_vliBt(v)%Bol  +  fine_vlist(v)%ca  *  vlist (va)% delta  H-  & 

fine-vlist(v)%cb  *  vlist (vb)%delta  +  flne-vlist(v)%cc  *  vlist(vc)%delta  +  & 

v,ca*va.delta  +  v.cb*vb.delta  +  vxc*vc.delta 

END  DO 
END 


Figure  4;  Multigrid  on  an  Unstructured  Mesh;  Routines  for  Restriction  and  Prolongation 
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PROGRAM 


REAL,  DI]V'IENSION(3,NUMNP)  ::  X 
REAL,  DIMENSI0N(4,NUMEL)  IX 
REAL,  DIMENSION(6,NUMNP)  F,  A,  V 

REAL,  DIMENSI0N(6,NUMEL)  FORCEl,  FORCE2,  FORCES,  FORCE4 
!HPF$  DISTRIBUTE  (*, BLOCK)  ::  X,  DC,  F,  A,  V,  FORCEl, ... 


.DO  T=1,MAX.TIME 

CALL  KFORCE(F,X,IX,...) 

END  DO 
END 


SUBROUTINE  KFORCE(f,x,ix,...) 

!HPF$  INHERIT  ::  F,  X,  IX,  ... 

REAL  XN(3,4),  FORCE(6,4) 

!HPF+  INDEPENDENT,  NEW(XN,FORCE,...),  ON  HOME(IX(l,I)) 
DO  I  =  1,  NUMEL 

XN  =  X(:,IX(:,I))  !  gather  coordinates 

CALL  MFORCE(XN,  FORCE,  ...)  !  calculate  forces 


FORCEl  (:,i)  =  FORCE(:,l) 

END  DO 

!HPF$  INDEPENDENT,  REDUCTION (F) 

DO  1=1,  NUMEL  !  sum-scatter  reduction 

DO  J  =  1,  6 

F(J,IX(1,I))  =  F(J,IX(1,I))  +  F0RCE1(J,I) 

END  DO 
END  DO 

END  SUBROUTINE 


Figure  5:  Simplified  structure  of  the  finite-element  crash  simulation  kerneL 
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